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Abstract 

We introduce a controlled form of recursion in XQuery, 
inflationary fixed points, familiar in the context of rela- 
tional databases. This imposes restrictions on the express- 
ible types of recursion, but we show that inflationary fixed 
points nevertheless are sufficiently versatile to capture a 
wide range of interesting use cases, including the semantics 
of Regular XPath and its core transitive closure construct. 

While the optimization of general user-defined recursive 
functions in XQuery appears elusive, we will describe how 
inflationary fixed points can be efficiently evaluated, pro- 
vided that the recursive XQuery expressions exhibit a dis- 
tributivity property. We show how distributivity can be as- 
sessed both, syntactically and algebraically, and provide 
experimental evidence that XQuery processors can substan- 
tially benefit during infiationary fixed point evaluation. 



1. Introduction 

The backbone of the XML data model, namely ordered, 
unranked trees of nodes, is inherently recursive and it is nat- 
ural to equip the associated languages with constructs that 
can query such recursive structures. To get from the re- 
cursive axes in XPath, e.g., ancestor and descendant, 
to XQuery's fl\ recursive user-defined functions, language 
designers took a giant leap, however. User-defined func- 
tions in XQuery admit arbitrary types of recursion — a con- 
struct that largely evades optimization approaches beyond 
"procedural" improvements like tail-recursion elimination 
or unfolding. 

This paper embarks on a journey that explores a con- 
trolled form of recursion in XQuery, the infiationary fixed 
point (IFP), familiar in the context of relational databases 
Hi. While this imposes restrictions on the expressible types 
of recursion, IFP embraces a family of widespread use cases 
of recursion in XQuery, including many forms of horizontal 
or vertical structural recursion and the pervasive transitive 
closure problem (IFP captures Regular XPath ||251 . in par- 
ticular). 



<! ELEMENT curriculum (course) *> 
<! ELEMENT course prerequisites> 
<!ATTLIST course code ID #REqUIRED> 
<! ELEMENT prerequisites (pre_code)*> 
<! ELEMENT pre_code #PCDATA> 

Figure 1. Curriculum data (simplified DTD). 



Example 1.1 The DTD of Figure [T] (taken from El) de- 
scribes recursive curriculum data, including courses, their 
lists of prerequisite courses, the prerequisites of the lat- 
ter, and so on. The XQuery program of Figure |2] uses 
the course element node with code "cl" to seed a com- 
putation that recursively finds all prerequisite courses, di- 
rect or indirect, of course " c 1 " . For a given sequence $x 
of course nodes, function fix(-) calls out to rec(-) to 
find their prerequisites. While new nodes are encountered, 
f ix(-) calls itself with the accumulated course node se- 
quence. (This is not expressible in XPath 2.0.) < 

Note that f ix(-) implements a generic fixed point com- 
putation: only the initiahzation (let $seed := • ■ •) and 
the payload function rec(-) are specific to the curriculum 
problem. This motivates the introduction of a syntactic form 
that can succinctly accommodate this pattern of computa- 
tion (Section|2]l. 

Most importantly, however, such computation in IFP 
form is susceptible to systematic optimization, provided 
that the payload (or body) of the recursion exhibits a spe- 
cific distributivity property. 

Unlike the general user-defined XQuery functions, this 
account of recursion puts the query processor into control 
in that it can decide whether the optimization may be safely 
applied. Distributivity may be assessed on a syntactical 
level — a non-invasive approach that can easily be realized 
on top of existing XQuery processors (Section |3]l. Further, 
though, if we adopt a relational view of the XQuery seman- 
tics (as in 1 15 1), the seemingly XQuery-specific distributiv- 
ity notion turns out to be elegantly and uniformly tractable 
on the familiar algebraic level (Section|4]i. 

Compliance with the restriction that IFP imposes on 
query formulation is rewarded by significant query runtime 



1 declare function rec ($cs) as nodeO* 

2 { $cs/id ( . /prerequisites/pre_code) 

3 }; 

4 
5 
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declare function f ix ($x) as nodeO* 
-[ let $res := rec ($x) 

return if (empty ($x except $res)) 

then $res 

else fix ($res union $x) 
}; 



11 

12 let $seed := doc ("curriculum. xml") 

13 /course [@code="cl"] 

14 return fix (rec ($seed)) 

Figure 2. Prerequisites for tlie course "cl" 
([IIIl marl<s tiie fixed point computation). 



savings that the IFP-inherent optimization hook can offer. 
We document the effect for the XQuery processors Mon- 
etDB/XQuery lH and Saxon lHO) in Section [S] This is pri- 
marily due to a substantial reduction of the number of items 
that are fed into the recursion's payload function (the naive 
implementation of Example 11.11 feeds already discovered 
course element nodes back into rec(-)). 

In Section |6l we stop by related work on recursion on 
the XQuery as well as the relational side of the fence, and 
finally wrap-up in Section]?] 

2. An Inflationary Fixed Point in XQuery 

The subsequent discussion will revolve around the re- 
cursion pattern embodied by function f ix(-) of Figure |2l 
known as the inflationary fixed point (IFP) IT]. We will 
introduce a new syntactic form to introduce IFP on the 
XQuery language level and then explore its semantics in 
the XQuery context, application, and optimization. 

In the following, we regard an XQuery expression ei 
containing a free variable $x as a function of $x. We write 
61(62) to denote ei[62/$x], i.e., the consistent replacement 
of all free occurrences of $x in ei by 62. Function fv{e) 
returns the set of free variables of expression e. We further 
introduce set-equality (—), a relaxed notion of equality for 
XQuery item sequences that disregards duplicate items and 
order, e.g., (l,"a") = ("a", 1,1). 

To streamline the discussion, in the following we assume 
computations over sequences of type node () * as trees are 
the recursive data structure in the XQuery Data Model. In 
this case, with Xi, X2 of type nodeO*, we havqj 

Xi=X2 ^ fs:ddo(Xi) =fs:ddo(X2) . 



An extension to general sequences of type it em () * is pos- 
sible and entails the replacement of XQuery's node set op- 
erations (union, except) with appropriate variants. 

Definition 2.1 (Inflationary Fixed Point) Let Cseed and 

erec($x) be XQuery expressions of type node () *. The in- 
flationary fixed point (IFP) of ereci^'^) seeded by Cseed is 
an XQuery expression represented by the following syntac- 
tic form; 

with $x seeded by Cseed recurse erec{$'^) ■ (1) 

The payload expression Crec is called the body, Cseed is 
called the seed, and $x is called the recursion variable of 
the inflationary fixed point operator 

The semantics of the IFP of erec($x) seeded by Cgeed 
is the sequence of nodes resk, if it exists, obtained in the 
following manner: 



TCSq < ^recy^seed) 

reSi+i ^- erec('^esi) union reSi 



i ^0 



'Here and in the following, fs:ddo() abbreviates the function 
f s:distinct-doc-order() of the XQuery Formal Semantics (3. 



where k ^ 1 is the minimum number for which resk = 
resk-i- Otherwise, the IFP of erec($x) seeded by Cgeed is 
undeflned. <] 

Note that if expression e^ec does not invoke node construc- 
tors (e.g., element {•} {•} or text {•}), such that the query 
operates over a finite domain of nodes, IFP will always be 
defined. Otherwise, the invocation of node constructors in 
the recursion body might yield an infinite node domain and 
IFP might be undefined. 

Example 2.2 In terms of the new with • • ■ seeded by • ■ • 
recurse syntactic form, we can now express the transitive 
closure query from Example II. II in a quite concise and ele- 
gant fashion: 

with $x seeded by doc ("curriculum. xml") 

/course [®code="cl"] (Ql) 
recurse $x/id ( . /prerequisites/pre_code) 

< 

Obviously, the new form with • • • seeded by • • ■ recurse 
is mere syntactic sugar as it can be equivalently ex- 
pressed via the recursive user-defined function template 
fix(-) (shown in | 1 in Figure |2]i. Since the syntac- 
tic form is a second-order construct taking an XQuery 
variable name and two XQuery expressions as arguments, 
function fix(-) has to be interpreted as a template in 
which the recursion body rec() needs to be instanti- 
ated (XQuery 1.0 does not support higher-order functions). 
Given this. Expression ([T]) is equivalent to the expression 
let $x := eseed return f ix (rec ($x) ). 



Using IFP to Compute Transitive Closure. Much like in 
the relational context, transitive closure is an archetype of 
recursive computation over XML instances. Regular XPath 
||25l . for example, defines the transitive closure of XPath lo- 
cation steps to obtain powerful primitives that express hor- 
izontal and vertical structural recursion. We can naturally 
extend this definition to any XQuery expression of type 
nodeO *. 

Definition 2.3 (Transitive Closure) Let e be an expression 
of type node ( ) *. The transitive closure e* of e is 



e union e/e union e/e/e union • • • 



(2) 



if the resulting node sequence is finite. Otherwise, e* is 
undefined. <\ 



Given simple restrictions on e, see Section [3T| with the new 
IFP form e"^ is (' . ' denotes the context node): 

with $x seeded by . recurse $x/e . 

IFP in SQL: 1999. IFP has found its way into SQL in 
terms of the WITH RECURSIVE clause inti'oduced by the 
ANSI/ISO SQL: 1999 standard |21 1. To exemplify, consider 
the table C(course, prerequisite) as a relational repre- 
sentation of the curriculum XML data (Figure[Tli. The pre- 
requisites P(course_code) of the course with code 'cl' 
then are: 

WITH RECURSIVE P(course_code) AS 
(SELECT prerequisite 

FROM C > seed 

WHERE course = 'cl') 
UNION ALL 
(SELECT C. prerequisite 

FROM P , C ) body 

WHERE P. course_code = C. course) 
SELECT DISTINCT * FROM P; 

Analogous to the XQuery variant, table P is seeded with 
the direct prerequisites of course ' c 1 ' before the join with 
table C in the body is iterated to also add all indirect prereq- 
uisites until P does not grow further 

The SQL: 1999 standard dictates quite rigid syntactical 
restrictions for the WITH RECURSIVE form (the body, in 
particular, must be linear. P may occur only once in its 
FROM clause). We will return to this in Section lX2l andl6l 

2.1. Algorithms for IFP 

The semantics of the inflationary fixed point in XQuery, 
i.e., the specification of the node sequence resk of Defini- 
tion 12.11 can be straightforwardly turned into an iterative 
algorithm to compute IFP. Figure |3(a)| shows the resulting 



do 



res <— erec(res) union res; 
while res grows ; 

(a) Algorithm A'fliVe 



res < 6rec t,6seed J , 

A <— res; 
do 

A ^ erec(A) except res; 

res ^~ A union res; 
while res grows ; 

(b) Algorithm DeZ?a 



Figure 3. Algorithms to evaluate the IFP of 
Crec given eseed- Result is res. 



declare function delta ($x,$res) as nodeO* 
{ let $delta := rec ($x) except $res 
return if (empty ($delta)) 

then $res 

else delta ($delta,$delta union $res) 
}; 

Figure 4. An XQuery formulation of Delta. 



procedure, commonly referred to as Naive in the literature 
||5l . In the do • ■ • while loop body, the procedure calls out 
to the recursion's payload function Creci') to determine the 
next portion of nodes that will augment the current interme- 
diate result. Only if erec() cannot contribute new nodes, 
the procedure returns the current res. 

Since res grows, this feeds the same nodes over and over 
again into tred)- Dependent on the nature of the pay- 
load, er-ec(-)'s answer might include nodes which we have 
seen before. Ultimately, Naive risks to initiate a substantial 
amount of redundant computation. 

A now folklore variation of Naive is the Delta algo- 
rithm [T7 1 of Figure |3(b)| In this variant, the payload is 
invoked only for those nodes that have not been encoun- 
tered in earlier iterations: node sequence A is the difference 
between CreciY^ last answer and the current result res. In 
general, erec(') will thus process fewer nodes. 

Delta introduces a significant potential for performance 
improvement, especially for large node sequences and 
computationally expensive payloads (Section |5]l. Fig- 
ure |4] shows the corresponding XQuery user-defined 
function delta(-,) which, for Example 11.11 and thus 
Query |Q1| can serve as a drop-in replacement for func- 
tion f ix() — line 14 then needs to be replaced by return 
delta (rec ($seed) , ()). 

Is this replacement of fix (•) bydelta(-,0 aZwayi a valid 
optimization? For XQuery, the answer is no. 

Example 2.4 Consider the following expression: 



let $seed := (<a/>,<b><c><d/></c></b>) 
return with $x seeded by $seed 

recurse if (count ($x/self : : a) ) 
then $x/* else () 



(Q2) 



Let a, b, c, and d denote the tree fragments constructed by 
the seed's subexpressions <a/>, <b><c><d/></c></b>, 
<c><d/></c>, and <d/>, respectively. Thus, b/* isc and 
c/* is d. 

The table below illustrates the progress of the iterations 
performed by algorithms A^aiVe and Delta. While the former 
computes Ca,b,c,d), the latter returns (a,b,c). 



Iteration 


1 

2 
3 



Naive 

res 



Delta 



A 



{a,b) 
{a,b,c) 
(a ,b,c,d) 
(a,b,c,d) 



ia,b) (a,b) 

(.a,b,c) (c) 
(.a,b,c) 



Note how (Ol resembles the distributivity property of 
functions defined on sets. Such a function e is distributive 
if, for all non-empty sets X, e{X) = Uwex ^iiv})- This 
property suggests a divide-and-conquer evaluation strategy 
in which e is applied to subsets (singletons) of X only. We 
define the corresponding distributivity property for XQuery 
as follows: 

Definition 3.1 Distributivity property for XQuery. Let e be 

an XQuery expression in which variable $x may occur free. 
Expression e is distributive for $x if, for any item sequence 
X ^ O and fresh variable $y. 



for $y in X return e($y) 



e{X) . (4) 



< 



What then is an effective characterization of those payloads 
for which Naive may safely be traded for Delta! 

3. Trading Naive for Delta 

We will now see that a simple notion of distributivity 
for XQuery expressions suffices to let an XQuery proces- 
sor safely switch to a more efficient evaluation mode for 
with $x seeded by Cseed recurse Crec- whenever ex- 
pression erec is distributive (in the sense defined below), 
algorithm Delta (Figure [3(b)| l preserves the desired IFP se- 
mantics. While the distributivity property is undecidable in 
general, we present two safe and effective approximations 
of distributivity, one formulated on the level of XQuery lan- 
guage syntax, and one cast in terms of an algebraic XQuery 
representation. The algebraic approximation will turn out 
to be particularly simple and uniform (Section|4|. 

3.1. Distributivity in XQuery 

Obviously, Delta computes the IFP for given expressions 
Bseed and Crec if the algorithm produces the same result as 
Naive on the same inputs. In particular, the algorithms are 
equivalent if both yield equivalent intermediate result se- 
quences in each iteration of their do • • • while loops. 

In its first loop iteration. Naive yields erec(erec(eseed)) 
union erecisseed) which is equivalent to Delta's first in- 
termediate reSUh (erec(erec(eseed)) eXCept erecieseed)) 

union e^ec (eseed ) ■ For the second and further iterations, an 
inductive proof can show the equivalence of all subsequent 
intermediate result sequences, if we may assume that, for 
two item sequences Xi, X2, we have 

erec{Xi union X2) = erec{Xi) union erec{X2) ■ (3) 

For lack of space, we do not reproduce the straightforward 
equational reasoning behind the proof here but refer to |J2l . 



In particular. Equality (|3]l is a straightforward consequence 
if we know that the recursion body e^ec is distributive for its 
free variable. Overall, we arrive at the following sufficient 
condition for the applicability of Delta: 

Theorem 3.2 Consider the expression with $x seeded 
by eseed recurse erec- If ^rec is distributive for $x, then 
algorithm Delta computes the IFP ofe^ec given egeed- 

XPath Location Steps. XPath location steps are a preva- 
lent example of distributive expressions in XQuery. Any 
expression of the form e($x) = $x/s is distributive for 
$x if the step subexpression s neither contains (i) free 
occurrences of $x, nor (ii) calls to fn:position() and 
fn: last () that refer to the context item sequence bound 
to $x, nor (Hi) node constructors. To see this, note that 
the XQuery Core equivalent ||9l of $x/s is fs:ddo(for 
$fs:dot in $x return s), and then rewrite the Ihs of 
Equation dU into its rhs, using the definition of =. 



Regular XPath. These observations about the distributiv- 
ity of XPath location steps extend to Regular XPath ||251 and 
thus also make this XPath extension susceptible to Delta- 
based evaluation. Since any Regular XPath step subexpres- 
sion s is of the form prescribed by ^ to iiUj above and Reg- 
ular XPath's transitive closure s'* is equivalently expressed 
as with $x seeded by . recurse $x/s (for the sim- 
ple proof see lIU), Theorem l3.2l asserts that we may indeed 
use algorithm Delta to evaluate s*. 

In contrast, expression e($x) = $x[l] is not distributive 
for $x in general. With variable $x bound to the sequence 
(<a/>,<b/>),$x[l] evaluates to <a/>, while for $y in 
$x return $y[l] yields (<a/>,<b/>). Effectively, this 
invalidates Equation (|4]l. 



3.2. Is Expression Crec Distributive? 
(A Syntactic Approximation) 

Whenever an XQuery processor plans the evaluation of 
with $x seeded by eseed recurse Crec, knowing the 
answer to "As Crec distributive for $x?" is particularly valu- 
able: we may legitimately expect Delta to be a significantly 
more efficient IFP evaluation strategy than Naive (Section|5] 
will indeed make this evident). While, unfortunately, there 
is no complete procedure to decide this questiorQ still we 
can safely approximate the answer Here, we will present 
purely syntactic, sufficient conditions for XQuery distribu- 
tivity. Section |4] approaches the same challenge on an alge- 
braic level. 

Intuitively, we may not apply a divide-and-conquer eval- 
uation strategy for an expression e($x), if any subexpres- 
sion of e inspects the sequence bound to $x as a whole: e 
is only evaluated after $x has been divided into individual 
items (see Equation |4]i. Obvious examples of such prob- 
lematic subexpressions are count ($x) and $x [1] , but also 
the general comparison $x = 10 (that involves existential 
quantification over the sequence bound to $x). 

Subexpressions whose value is independent of $x, on 
the other hand, are distributive. The only exception of this 
rule are XQuery's node constructors, e.g., text {•}, which 
create new node identities upon each invocation. With $x 
bound to (<a/> , <b/>) , for example, 

text{"c"} ^ for $y in $x return text{"c"} , 

since the rhs will yield a sequence of two distinct text nodes. 

The inference rules of Figure |5] have been designed to im- 
plement these considerations. The rules syntactically as- 
sess the distributivity safety ds$x (e) of an arbitrary LiX- 
Query |fT9l input expression e by traversing e's parse tree 
in a bottom-up fashion. LiXQuery is a sublanguage of 
XQuery that preserves Turing-completeness, removes all 
but the most basic types, and excludes selected, rather es- 
oteric, language features. LiXQuery's simplification of the 
verbose XQuery syntax and semantics have been designed 
to make LiXQuery ideal for investigations of interesting 
language properties, yet allow findings to be transposed to 
full XQuery. 

Rules [FORll and [FOR2l ensure that the recursion variable 
$x occurs either in the body 62 or in the range expression 
ei of a f or-iteration but not both. This coincides with the 
linearity constraint of SQL: 1999. A similar remark applies 
to Rules IStepII and ISTEP2I (in XQuery, the step operator 
'/' essentially describes an iteration over a sequence of type 



nodeO* ID). Also note how Rule [FunCall] recursively 
infers the distributivity of the body of a called function if the 
recursion variable occurs free in the function argument(s). 

In our context, whenever the XQuery processor is able 
to infer ds$x (e) for an input expression e, then it is guaran- 
teed that e is indeed distributive for $x. The proof of this 
implication, by induction on the syntactical structure of e, 
is to be found in f^l. 

Distributivity Hints. Still, the inference rules of Fig- 
ure |5] can only check sufficient syntactical conditions for 
distributivity to hold. The processor might thus actu- 
ally miss distributive expressions and will fail to infer 
(is $x (count ($x) >= 1), for example. However, it is in- 
teresting to note that we can support the XQuery processor 
in its distributivity assessment, since every distributive ex- 
pression is equivalent to a distributivity-safe expression: 

If expression e($x) is distributive for $x, then it is set- 
equal to for $y in $x return e($y), for which 
the rules of Figure |5] will successfully infer distribu- 
tivity safety ds$x (•). 



^If, for two arbitrary expression ei , 62 in which $x does not occur free, 
an XQuery processor could assess whether if (deep-equal (ei ,62)) 
then $x else $x[l] is distributive for $x, it could also decide the 
equivalence of ei and 62 (which is impossible). 



This is a direct consequence of Rule IFor2I (Figure |5]l and 
Definition 13. II Thus, at the expense of a slight query refor- 
mulation, we may provide a "syntactic distributivity hint" to 
the XQuery processor which effectively paves the way for 
IFP evaluation via algorithm Delta. 

4. Distributivity and Relational XQuery 

In this section we will, literally, follow an alternative 
route to decide the applicability of Delta for the evaluation 
of the IFP of an XQuery expression Crec- We leave syntax 
aside and instead inspect relational algebraic code that has 
been compiled for Crec'- the equivalent algebraic represen- 
tation of Crec renders the check for the inherently algebraic 
distributivity property particularly uniform and simple. 

Relational XQuery. This alternative route is inspired by 
the Pathfinder projeco which fully implements such a 
purely relational approach to XQuery. Pathfinder compiles 
instances of the XQuery Data Model (XDM) and XQuery 
expressions into relational tables and algebraic plans over 
these tables, respectively, and thus follows the dashed path 
in Figure |6] The translation strategy built into the compiler 
has been carefully designed (/) to faithfully preserve the 
XQuery semantics (including compositionality, node iden- 
tity, iteration and sequence order), and (//) yield relational 
plans which exclusively rely on regular relational query en- 
gine technology (no specific operators or index structures 
are required, in particular) lITSi . 



^http : //www . pathfinder- xquery ■ org/] 
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ds$x ($■«) rfs$x (if (ei) then 62 else £3) ds$x (ei © 62 

$x^/?;(ei) ds$x (62) _ _. dsi^{ei) $x (^ fv{e2) 



ds$x (for $v at $p in ei return 62 
$x ^ /?;(ei) rfs$x (62) 



- (|FORl|) 



ds$x (for $1" in ei return 62' 



(|For2|| 



: (let $v := ei return 62) 

^x^/;;(ei) rfs$x (cQ^^i, „_^i 
/ typeswitch (ei) \ 

case Ti return ci 



CeiB '^-»-(^^) ^-^^(^^) '^-^-(^^)oT2t 



■ (|TypeSw|) 



ds$x (let $« := ei return 62 

$x^/?;(ei) rfs$x (62) 



rf.5$x 



rfs$x (61/62) 

rfs$x(ei) $x^/i;(e2) 
ds$x(ei/62) 



(|Step1|| 
(|Step2|) 



case Tn return Cn 
\ default return Cn+iJ 

declare function f ($vi ,. . . ,$Vn) ieo} {$x e fv{ei) ^ ds$y, (ci) A ds$vi ieo))i=i...n 

ds$x (/(ei,. . . ,e„)) 



(|FunCall|) 



Figure 5. Distributivity-safety ds$^{-): A syntactic approximation of the distributivity property for 
LiXQuery expressions. 



XDM 



XQuery 



-^-XDM 



Operator 



Semantics 



Push? 



Tables > Tables 

Relational Algebra 



Figure 6. Relational XQuery (dashed path) 
faithfully implements the XQuery semantics. 



The compiler emits a dialect of relational algebra that 
mimics the capabilities of modern SQL query engines (Ta- 
ble [T]). Note that the non-textbook operators, like e or d, 
merely are macros representing "micro plans" composed of 
standard relational operators: expanding da: -.n reveals Mp, 
where p is a conjunctive range predicate that realizes the se- 
mantics of an XPath location step along axis a with node 
test n, for example. The row numbering operator g directly 
compares with SQL:1999's ROW_NUMBER. The plans oper- 
ate over relational encodings of XQuery item sequence held 
in flat (INF) tables with schema iter|pos|item. In these ta- 
bles, columns iter and pes are used to properly reflect f or- 
iteration and sequence order, respectively. Column item 
carries encodings of XQuery items, i.e., atomic values or 
nodes. 

Further details of Relational XQuery do not affect our 
present discussion of distributivity or IFP evaluation and 
may be found in |15|. In the following, let /e^ denote the 
algebraic plan that has been compiled for XQuery expres- 
sion e. 
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project onto col.s ai, rename hi into a; 

select rows with column b — true 

join with predicate p ffi 

Cartesian product © 

duplicate elimination (DISTINCT) - 

union © 

disjoint difference (EXCEPT ALL) - 

aggregates (group by b, result in a) — 

Tz-ary arithmetic/comparison operator o 

unique row tagging (tag in a) 

ordered row numbering (by bi, . . . , b„) — 

XPath step join (axis a, node test n) 

node constructors — 

fixpoint operators 



Table 1. Relational algebra dialect emitted by 
the Pathfinder compiler. 



4.1. Is Expression e^g, Distributive? 
(An Algebraic Account) 

An occurrence of the new with $x seeded, by egeed 
recurse erec form in a source XQuery expression will be 
compiled into a plan fragment as shown here 
on the right. Operator /i, the algebraic repre- 
sentation of algorithm A^fliVe (Figure [3(a)l ), it- 
erates the evaluation of the algebraic plan for 
Crec and feeds its output 9 back to its input X 
until the IFP is reached. If we can guarantee 
that the plan for Crec is distributive, we may 
safely trade ii for its Delta-hased variant /i'^ 
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(a) Is erec distributive? 



(b) Taking a big step: Pushing 
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Figure 7. Algebraic distributivity assessment. 
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closed plan fragments with single entry and exit points (en- 
closed by '~ 1 in Figure|9]l. These templates embody al- 
gebraic implementations of basic XQuery constructs, e.g., 
the semantics of f or-iteration or XPath location steps. As- 
sessing the distributivity of such plan templates is a one time 
effort. Once this has been done, whenever a distributive 
template is encountered, the U push up process may disre- 
gard the template's contents and instead perform a single 
big step across the template (see Figure [7(b)| . 

For the XQuery processor, this suggests the following sim- 
ple procedure as a replacement for ds$^ (•) (Section [J!2l ) to 
assess the distributivity of Crec'- 



Start with the algebraic plan for e^ec with its input J, 

replaced by / X' 

while not all U have reached 9 do 

Perform a big step or push U up through its parent 
operator, if possible. Otherwise return false; 

return true; 



which, in general, will feed significantly less items back in 
each iteration (see Figure [3(b)| and Section|5]l. 

In this algebraic setting, if the recursion body e^c is dis- 
tributive, its relational plan will satisfy the equality shown 
in Figure |7(a)| This equality is the algebraic expression 
of a divide-and-conquer evaluation strategy for e^ec (Sec- 
tion B.U : evaluating e^c over a composite input (Ihs, / X) 
yields the same result as the union of the evaluation of Crec 
over a partitioned input (rhs). Effectively, the union opera- 
tor U has been completely pushed up through all branches of 
the DAG-shaped algebraic plan for Crec- Zooming in from 
the plan to the operator level, Figure [8] depicts how U is 
pushed up through unary (0) and binary (©) operators. Col- 
umn 'Push?' of TabIe[T]indicates whether U may indeed be 
validly pushed through a given operator. Note that this push 
through is prohibited by exactly those operators that require 
to consume their complete input to produce the result. This 
affects, e.g., aggregates, difference, and the row numbering 
operator As before, the occurrence of node constructors 
renders e^c non-distributive. 

Because our primary goal is distributivity assessment (as 
opposed to query evaluation — ^but see Section |5]l, we may 
actually employ simplified variants of e^c in this context. 
In particular, since the definition of distributivity disregards 
duplicates and order (Definition 13. Il l, the compiler may 
choose to remove code from Crec that is used to eliminate 
duplicate nodes after XPath location steps as well as omit 
those parts of the plan that realize the proper XQuery order 
semantics [14]. 

Further, the plans generated by the XQuery compiler 
typically contain numerous instantiations of plan templates. 



Figure |9] depicts the algebraic representations of the recur- 
sion bodies of the Queries |Q1| and |Q2| (Section |2|i. For 
Query |Q1[ to push U through from J, to 9, the distribu- 
tivity check will succeed after it has performed two steps 
across the two peripheral projections plus one intermedi- 
ate big step across the f or-iteration that implements the se- 
mantics of the $x/id() lookup. For Query [Q2] U will be 
pushed through TTjter.item and then upwards the two branches 
of the DAG-shaped plan. In the right branch, the aggregate 
count jtem/iter blocks the process (Table [T]! which indicates 
that the processor may not use algorithm Delta (or the p'^ 
variant of the fixed point operator) to evaluate Query [QS] 



Algebraic vs. Syntactic Approximation. Compared to 
the syntactic approximation ds{-), this algebraic account of 
distributivity draws its conciseness from the fact that the 
rather involved XQuery semantics and substantial number 
of built-in functions nevertheless map to a small number of 
algebraic primitives (given suitable relational encodings of 
the XDM). Further, for these primitives, the algebraic dis- 
tributivity property is readily decided. 

To make this point, consider this slight yet equivalent 
variation of Query |Ql| in which variable $x now occurs free 
in the argument of function id ( • ) : 

with $x seeded by doc ("curriculum. xml") 
/course [@code="cl"] 
recurse id ($x/prerequisites/pre_code) 

If we unfold the implementation of the XQuery built-in 
function id() (effectively, this expansion is performed 
when Rule IFunC ALLl recursivelv invokes (is$x (•) to assess 



declare variable $doc := docC'auction.xml") ; 
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Figure 9. Relational representations of the re- 
cursion bodies Crec of Queries [QTI and [Q2l 



the distributivity of the function body of id ( • ) ), we obtain 

with $x seeded by docC'curriculum.xml") 
/course [@code="cl"] 

recurse 

for $c in docC'curriculum.xml") /course 
where $c/@code = $x/prerequisite/pre_code 
return $c 

The syntactic approximation will flag the recursion body 
as non-distributive because of the general comparison (=) 
in the where clause (Section F3.2b . While the algebraic ap- 
proach would be unaffected by the variation, the rule set of 
Figure |5] would need a specific rule for id() to be able to 
infer its actual distributivity. 

5. Practical Impact of Distributivity and Delta 

Recasting a recursive XQuery query as an inflationary 
fixed point computation imposes restrictions. Such recast- 
ing, however, also puts the query processor into control 
since the applicability of a promising optimization, trading 
Naive for Delta, becomes effectively decidable. This sec- 
tion provides the evidence that significant gains can indeed 
be realized, much like in the relational domain. 

To quantify the impact, we implemented the two 
fixed point operator variants ii and /i'^ (Section 14. Il l 
in MonetDB/XQuery 0.18 [81, an efficient and scalable 
XQuery processor that consequently implements the Re- 
lational XQuery approach (Section |4]l. Its algebraic com- 
piler front-end Pathfinder has been enhanced (i) to pro- 
cess the syntactic form with • • • seeded by ■ • • recurse. 



declare function bidder ($in as nodeO*) as nodeO* 
{ for $id in $in/@id 

let $b := $doc//open_auction[seller/@person = $id] 
/bidder /per sonref 

return $doc//people/person[@id = $b/(§person] 
}; 

for $p in $doc//people/person 
return <person> 

{ $p/@id y 

{ data ( (with $x seeded by $p 

recurse bidder ($x) )/@id) } 
</person> 

Figure 10. XMark bidder network query. 



and (ii) to implement the algebraic distributivity check. All 
queries in this section were recognized as being distributive 
by Pathfinder. To demonstrate that any XQuery processor 
can benefit from optimized IFP evaluation in the presence of 
distributivity, we also performed the transition from Naive 
to Delta on the XQuery source level and let Saxon-SA 8.9 
fSOl process the resulting user-defined recursive queries (cf. 
Figures |2] and |4]l. All experiments were conducted on a 
Linux-based host (64 bit), with two 3.2 GHz Intel Xeon® 
CPUs, 8 GB of primary and 280 GB SCSI disk-based sec- 
ondary memory. 

Table |2] summarizes our observations for four query 
types, chosen to inspect the systems' behavior for growing 
input XML instance sizes and varying result sizes at each re- 
cursion level (the maximum recursion depth ranged from 5 
to 33). 



XMark Bidder Network. To assess scalability, we com- 
puted a bidder network — recursively connecting the sell- 
ers and bidders of auctions (Figure [TOl i — over XMark ll24l 
XML data of increasing size (from scale factor 0.01, small, 
to 0.33, huge). If Delta is used to compute the IFP of this 
network, MonetDB/XQuery (2.2 to 3.3 times faster) as well 
as Saxon (L2 to 2.7 times faster) benefit significantly. Most 
importantly, note that the number of nodes in the network 
grows quadratically with the input document size. Algo- 
rithm Delta feeds significantly less nodes back in each re- 
cursion level which positively impacts the complexity of 
the value-based join inside recursion payload bidder (■): 
for the huge network. Delta exactly feeds those 10 million 
nodes into bidder (•) that make up the result — Naive re- 
peatedly revisits intermediate results and processes 9 times 
as many nodes. 



Romeo and Juliet Dialogs. Far less nodes are processed 
by a recursive expression that queries XML markup of 



Query 


MonetDB/XQuery 


Saxon-SA 8.9 


Total # of Nodes Fed Back 


Recursion 




Naive 


Delta 


Naive 


Delta 


Naive 


Delta 


Depth 


Bidder network (small) 


362 ms 


165 ms 


2,307 ms 


1,872 ms 


40,254 


9,319 


10 


Bidder network (medium) 


5,010 ms 


1,995 ms 


15,027 ms 


7,284 ms 


683,225 


122,532 


16 


Bidder network (large) 


40,785 ms 


13,805 ms 


123,316 ms 


52,436 ms 


5,694,390 


961,356 


15 


Bidder network (huge) 


9 m 46 s 


176,890 ms 


32 m 40 s 


12 m 04 s 


87,528,919 


9,799,342 


24 


Romeo and Juliet 


6,795 ms 


1,260 ms 


1,150 ms 


818 ms 


37,841 


5,638 


33 


Curriculum (medium) 


183 ms 


135 ms 


1,308 ms 


1,040 ms 


12,301 


3,044 


18 


Curriculum (large) 


1,466 ms 


646 ms 


3,485 ms 


2,176 ms 


127,992 


19,780 


35 


Hospital (medium) 


734 ms 


497 ms 


1,301 ms 


1,290 ms 


99,381 


50,000 


5 



Table 2. Naive vs. Delta: Comparison of query evaluation times and total number of nodes fed back. 



Shakespeare's Romeo and JulieO to determine the max- 
imum length of any uninterrupted dialog. Seeded with 
SPEECH element nodes, each level of the recursion ex- 
pands the currently considered dialog sequences by a sin- 
gle SPEECH node given that the associated SPEAKERS are 
found to alternate (horizontal structural recursion along 
the following-sibling axis — we do not reproduce the 
query here for space reasons.) Although the recursion is 
shallow (depth 6 on average), Table |2] shows how both, 
MonetDB/XQuery and Saxon, completed evaluation up to 
5 times faster because the query had been specified in a dis- 
tributive fashion. 



Transitive Closures. Two more queries, taken directly 
from related work ||221ITT1 . compute transitive closure prob- 
lems (we generated the data instances with the help of ToX- 
gene 1^). The first query implements a consistency check 
over the curriculum data (cf. Figure[T]l and finds courses that 
are among their own prerequisites (Rule 5 in the Curriculum 
Case Study in Appendix B of 1221 ). Much like for the bid- 
der network query, the larger the query input (medium in- 
stance: 800 courses, large: 4,000 courses), the better Mon- 
etDB/XQuery and Saxon exploited Delta. 

The last query in the experiment explores 50,000 hospi- 
tal patient records to investigate a hereditary disease ifTTI . In 
this case, the recursion follows the hierarchical structure of 
the XML input (from patient to parents), recursing into sub- 
trees of a maximum depth of 5. Again, Delta makes a no- 
table difference even for this computationally rather "light" 
query. 

We believe that this renders this particular controlled 
form of XQuery recursion and its associated distributivity 
notion attractive, even for processors that do not implement 
a dedicated fixed point operator (like Saxon). 



Ihttp : //www . Iblblio . org/xml/examples/shakespeare/ | 



6. More Related Work 

Bringing adequate support for recursion to XQuery is 
a core research matter on various levels of the language. 
While the efficient evaluation of the recursive XPath axes 
{e.g., descendant or aiicestor) is well understood by now 
|l3][T6l, the optimization of recursive user-defined functions 
has been found to be tractable only in the presence of re- 
strictions: ll23l[T3]| propose exhaustive inlining of functions 
but require that functions are structurally recursive (use 
axes child and descendant to navigate into subtrees only) 
over acyclic schemata to guarantee that inlining terminates. 
Note that, beyond inlining, this type of recursion does not 
come packaged with an effective optimization hook compa- 
rable to what the inflationary fixed point offers. 

The distinguished use case for inflationary fixed point 
computation is transitive closure. This is also reflected by 
the advent of XPath dialects like Regular XPath |l23 and the 
inclusion of a dedicated dyn: closure (•) construct in the 
EXSLT function library UOJ . We have seen applications in 
Section Isl ll22l [TTI and recent work on data integration and 
XML views adds to fliis ifTH . 

In the domain of relational query languages. Naive is the 
most widely described algorithmic account of the inflation- 
ary fixed point operator |5|. Its optimized Delta variant, in 
focus since the 1980's, has been coined delta iteration iflTll . 
semi-naive H, or wavefront 1 18| strategy in earlier work. 

Since our work rests on the adaption of these original 
ideas to the XQuery Data Model and language, the large 
"relational body" of work in this area should be directly 
transferable, even more so in the Relational XQuery con- 
text. In particular, optimization techniques like Magic Set 
rewriting |4] should apply (this has not been investigated in 
the present paper). 

The adoption of inflationary fixed point semantics by Data- 
log and SQL: 1999 with its WITH RECURSIVE clause (Sec- 
tion |2]i led to investigations of the applicability of Delta 
for these recursive relational query languages. For strati- 



fied Datalog programs |[T1, Delta is applicable in all cases: 
positive Datalog maps onto the distributive operators of re- 
lational algebra (tt, a, m, U, fl) while stratification yields 
partial applications of the difference operator x\Rm which 
R is fixed (/(x) = x \ i? is distributive). 

SQL: 1999, on the other hand, imposes rigid syntacti- 
cal restrictions II2TI on the iterative fullselect (recursion 
body) inside WITH RECURSIVE that make Delta applicable: 
grouping, ordering, usage of column functions (aggregates), 
and nested subqueries are ruled out, as are repeated refer- 
ences to the virtual table computed by the recursion. Re- 
placing this coarse syntactic check by an algebraic distribu- 
tivity assessment (Section HJi would render a larger class of 
queries admissible for efficient fixed point computation. 

7. Wrap-Up 

This paper may be read in two ways: 

(i) As a proposal to add an inflationary fixed point con- 
struct, along the lines of with • • • seeded by • • • recurse, 
to XQuery (this has actually been discussed by the 
W3C XQuery working group in the very early XQuery days 
of 2001a but then dismissed because the group aimed for a 
first-order language design at that time). 

(ii) As a guideline for query authors as well XQuery 
processor designers to check for and then exploit distribu- 
tivity during the evaluation of recursive queries. 

We have seen how such distributivity checks can be used 
to safely unlock the optimization potential, namely algo- 
rithm Delta, that comes tightly coupled with the inflation- 
ary fixed point semantics. MonetDB/XQuery implements 
this distributivity check on the algebraic level and signifi- 
cantly benefits whenever the De/fa-based operator /i^ may 
be used for fixpoint computation. Even if the approach is 
realized on the coarser syntactic level on top of an existing 
XQuery processor, feeding back less nodes in each recur- 
sion level yields substantial performance improvements. 

Remember that the distributivity notion suggests a divide- 
and-conquer evaluation strategy (Section lsTTT l in which parts 
of a computation may be performed independently (before 
a merge step forms the final result). Beyond recursion, 
this may lead to improved XQuery compilation strategies 
for back-ends that can exploit such independence, e.g, set- 
oriented relational query processors (cf. loop-Hfting [15l ) as 
well as parallel or distributed execution platforms. 
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